90 research outputs found
Environmental Sound Classification with Parallel Temporal-spectral Attention
Convolutional neural networks (CNN) are one of the best-performing neural
network architectures for environmental sound classification (ESC). Recently,
temporal attention mechanisms have been used in CNN to capture the useful
information from the relevant time frames for audio classification, especially
for weakly labelled data where the onset and offset times of the sound events
are not applied. In these methods, however, the inherent spectral
characteristics and variations are not explicitly exploited when obtaining the
deep features. In this paper, we propose a novel parallel temporal-spectral
attention mechanism for CNN to learn discriminative sound representations,
which enhances the temporal and spectral features by capturing the importance
of different time frames and frequency bands. Parallel branches are constructed
to allow temporal attention and spectral attention to be applied respectively
in order to mitigate interference from the segments without the presence of
sound events. The experiments on three environmental sound classification (ESC)
datasets and two acoustic scene classification (ASC) datasets show that our
method improves the classification performance and also exhibits robustness to
noise.Comment: submitted to INTERSPEECH202
Improving Audio-Text Retrieval via Hierarchical Cross-Modal Interaction and Auxiliary Captions
Most existing audio-text retrieval (ATR) methods focus on constructing
contrastive pairs between whole audio clips and complete caption sentences,
while ignoring fine-grained cross-modal relationships, e.g., short segments and
phrases or frames and words. In this paper, we introduce a hierarchical
cross-modal interaction (HCI) method for ATR by simultaneously exploring
clip-sentence, segment-phrase, and frame-word relationships, achieving a
comprehensive multi-modal semantic comparison. Besides, we also present a novel
ATR framework that leverages auxiliary captions (AC) generated by a pretrained
captioner to perform feature interaction between audio and generated captions,
which yields enhanced audio representations and is complementary to the
original ATR matching branch. The audio and generated captions can also form
new audio-text pairs as data augmentation for training. Experiments show that
our HCI significantly improves the ATR performance. Moreover, our AC framework
also shows stable performance gains on multiple datasets.Comment: Accepted by Interspeech202
All you need is a second look: Towards Tighter Arbitrary shape text detection
Deep learning-based scene text detection methods have progressed
substantially over the past years. However, there remain several problems to be
solved. Generally, long curve text instances tend to be fragmented because of
the limited receptive field size of CNN. Besides, simple representations using
rectangle or quadrangle bounding boxes fall short when dealing with more
challenging arbitrary-shaped texts. In addition, the scale of text instances
varies greatly which leads to the difficulty of accurate prediction through a
single segmentation network. To address these problems, we innovatively propose
a two-stage segmentation based arbitrary text detector named \textit{NASK}
(\textbf{N}eed \textbf{A} \textbf{S}econd loo\textbf{K}). Specifically,
\textit{NASK} consists of a Text Instance Segmentation network namely
\textit{TIS} ( stage), a Text RoI Pooling module and a Fiducial pOint
eXpression module termed as \textit{FOX} ( stage). Firstly,
\textit{TIS} conducts instance segmentation to obtain rectangle text proposals
with a proposed Group Spatial and Channel Attention module (\textit{GSCA}) to
augment the feature expression. Then, Text RoI Pooling transforms these
rectangles to the fixed size. Finally, \textit{FOX} is introduced to
reconstruct text instances with a more tighter representation using the
predicted geometrical attributes including text center line, text line
orientation, character scale and character orientation. Experimental results on
two public benchmarks including \textit{Total-Text} and \textit{SCUT-CTW1500}
have demonstrated that the proposed \textit{NASK} achieves state-of-the-art
results.Comment: 5 pages, 6 figure
A Global-local Attention Framework for Weakly Labelled Audio Tagging
Weakly labelled audio tagging aims to predict the classes of sound events
within an audio clip, where the onset and offset times of the sound events are
not provided. Previous works have used the multiple instance learning (MIL)
framework, and exploited the information of the whole audio clip by MIL pooling
functions. However, the detailed information of sound events such as their
durations may not be considered under this framework. To address this issue, we
propose a novel two-stream framework for audio tagging by exploiting the global
and local information of sound events. The global stream aims to analyze the
whole audio clip in order to capture the local clips that need to be attended
using a class-wise selection module. These clips are then fed to the local
stream to exploit the detailed information for a better decision. Experimental
results on the AudioSet show that our proposed method can significantly improve
the performance of audio tagging under different baseline network
architectures.Comment: Accepted to ICASSP202
KCRC-LCD: Discriminative Kernel Collaborative Representation with Locality Constrained Dictionary for Visual Categorization
We consider the image classification problem via kernel collaborative
representation classification with locality constrained dictionary (KCRC-LCD).
Specifically, we propose a kernel collaborative representation classification
(KCRC) approach in which kernel method is used to improve the discrimination
ability of collaborative representation classification (CRC). We then measure
the similarities between the query and atoms in the global dictionary in order
to construct a locality constrained dictionary (LCD) for KCRC. In addition, we
discuss several similarity measure approaches in LCD and further present a
simple yet effective unified similarity measure whose superiority is validated
in experiments. There are several appealing aspects associated with LCD. First,
LCD can be nicely incorporated under the framework of KCRC. The LCD similarity
measure can be kernelized under KCRC, which theoretically links CRC and LCD
under the kernel method. Second, KCRC-LCD becomes more scalable to both the
training set size and the feature dimension. Example shows that KCRC is able to
perfectly classify data with certain distribution, while conventional CRC fails
completely. Comprehensive experiments on many public datasets also show that
KCRC-LCD is a robust discriminative classifier with both excellent performance
and good scalability, being comparable or outperforming many other
state-of-the-art approaches
SpecAugment++: A Hidden Space Data Augmentation Method for Acoustic Scene Classification
In this paper, we present SpecAugment++, a novel data augmentation method for
deep neural networks based acoustic scene classification (ASC). Different from
other popular data augmentation methods such as SpecAugment and mixup that only
work on the input space, SpecAugment++ is applied to both the input space and
the hidden space of the deep neural networks to enhance the input and the
intermediate feature representations. For an intermediate hidden state, the
augmentation techniques consist of masking blocks of frequency channels and
masking blocks of time frames, which improve generalization by enabling a model
to attend not only to the most discriminative parts of the feature, but also
the entire parts. Apart from using zeros for masking, we also examine two
approaches for masking based on the use of other samples within the minibatch,
which helps introduce noises to the networks to make them more discriminative
for classification. The experimental results on the DCASE 2018 Task1 dataset
and DCASE 2019 Task1 dataset show that our proposed method can obtain 3.6% and
4.7% accuracy gains over a strong baseline without augmentation (i.e.
CP-ResNet) respectively, and outperforms other previous data augmentation
methods.Comment: Submitted to Interspeech 202
- …